The identification of variable-length, equifrequent character strings in a natural language data base

نویسندگان

  • A. C. Clare
  • E. M. Cook
  • Michael F. Lynch
چکیده

s Service. This is issued biweekly, and includes the titles, authors' names, and bibliographic references of currently published articles of chemical interest. The issue used was No. 1, 1971, dated 11 January. A typical entry from the issue is shown in Fig. 1; the bibliographic reference is given as the ASTM Coden. The titles are recorded in upper-case characters. An occasional artefact arises through the insertion of additional space symbols; the printed publication includes a KWIC (Key Word In Context) index, and the spaces ensure that certain chemical word stems such as QUINONE in Fig. 1 (the word is normally written as PHYLLOQUINONE) are indexed. A set of simple programs (written in PLAN, the ICL 1900 series assembly language) was devised to produce counts of n-grams (i.e., strings of 1, 2, 3 and 5 characters), including the space character, for values of n between 1 and 5. The program to count single character occurrences used the binary value of the character code to address a position in a 62-word array. The digrams were counted by using a two-dimensional array (62 x 62 = 3844). Longer /j-grams (« = 3 and 5) were created by taking a window equal to that number of characters and moving it along the title record, creating a new record at each position (a space was inserted as the initial character of each title). The records were written to tape, and subsequently sorted, counted and printed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Human and Machine Understanding Of Natural Language Character Strings

There is a great deal of variability in the way in which different language users understand a given natural language (NL) character string. This variability probably arises because of some combination of differences in language users’ perceptions of its context-of-use (pragmatics), identity and mode of organization of its meaning bearing parts (syntax), and in the meanings assigned to those pa...

متن کامل

Character sets of strings

Given a string S over a finite alphabet Σ, the character set (also called the fingerprint) of a substring S′ of S is the subset C ⊆ Σ of the symbols occurring in S′. The study of the character sets of all the substrings of a given string (or a given collection of strings) appears in several domains such as rule induction for natural language processing or comparative genomics. Several queries a...

متن کامل

Finding and Identifying Text in 900+ Languages

This paper presents a trainable open-source utility to extract text from arbitrary data files and disk images which uses language models to automatically detect character encodings prior to extracting strings and for automatic language identification and filtering of non-textual strings after extraction. With a test set containing 923 languages, consisting of strings of at most 65 characters, a...

متن کامل

Occurrence of morphologic variability in tick Hyalomma anatolicum anatolicum (Acari: Ixodidae)

BACKGROUND: Taxonomy and identification of the ticks in the genus Hyalomma, the most significant vectors of animal and human pathogen agents, have always been debatable. Scientists believe that variation within the taxa of the genus Hyalomma is the most important factor which causes misidentification. OBJECTIVES: The purpose of this study is to identify valuable characters for male H. anatolicu...

متن کامل

Variations in EFL Teachers’ Pedagogical Knowledge Base as a Function of Their Teaching License Status

The study of teachers’ pedagogical knowledge base (PKB) to discover how teachers think and work is attracting increasing attention in ELT. Against this background, the present study aimed at probing the likely variations in EFL teachers’ pedagogical knowledge base as a function of their teaching license status. To this aim, six teachers (two standard-licensed, two alternatively-licensed, and tw...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Comput. J.

دوره 15  شماره 

صفحات  -

تاریخ انتشار 1972